Clustering dialects automatically: A mutual information approach
نویسندگان
چکیده
Dialects can be categorized in many ways. Using external features, dialects may be grouped by geographic location (e.g., Irish English), ethnic identity (e.g., AAVE), or social networks (e.g., Liberian Settler English) of their speakers. Or, using internal features, dialects may be grouped by shared features of pronunciation, vocabulary, or grammar. We explore quantitative approaches to see how similarly dialects cluster by these different methods. We describe a method of clustering dialects according to patterns of shared phonological features. While previous linguistic research has generally considered such phonological features as independent of each other, we examine their statistical co-variation. That is, we look at the degree to which variation in one feature predicts variation in each other feature, or Mutual Information (MI). As an example, we look at the degree to which we can predict whether a dialect will exhibit the cot/caught merger based on knowledge of whether they vocalize /r/ in the word barn. Within phonological theory, these variables are independent of each other, but they do exhibit statistical dependence. To test our method, we explore a data set consisting of 168 binary features describing the pronunciation of vowels and consonants of English speakers from 35 countries and regions. This is a subset of the data collected for the Handbook of Varieties of English (Schneider et al. 2005). These dialects are grouped according to patterns of shared features. The results of this method of categorizing dialect varieties by binary pronunciation features are compared to traditional groupings based on external features. In many ways, the clusters produced by this method are similar. We also compare differences in clustering outcomes determined by phonological vs. morphosyntactic features, as well as differences that depend on the method of clustering.1
منابع مشابه
Clustering of a Number of Genes Affecting in Milk Production using Information Theory and Mutual Information
Information theory is a branch of mathematics. Information theory is used in genetic and bioinformatics analyses and can be used for many analyses related to the biological structures and sequences. Bio-computational grouping of genes facilitates genetic analysis, sequencing and structural-based analyses. In this study, after retrieving gene and exon DNA sequences affecting milk yield in dairy ...
متن کاملA Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach
In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...
متن کاملAutomatic Band Selection in Multispectral Images Using Mutual Information-Based Clustering
Feature selection and dimensionality reduction are crucial research fields in pattern recognition. This work presents the application of a novel technique on dimensionality reduction to deal with multispectral images. A distance based on mutual information is used to construct a hierarchical clustering structure with the multispectral bands. Moreover, a criterion function is used to choose auto...
متن کاملAutomatic concept identification in goal-oriented conversations
We address the problem of identifying key domain concepts automatically from an unannotated corpus of goal-oriented human-human conversations. We examine two clustering algorithms, one based on mutual information and another one based on Kullback-Liebler distance. In order to compare the results from both techniques quantitatively, we evaluate the outcome clusters against reference concept labe...
متن کاملA Novel Clustering Approach for Estimating the Time of Step Changes in Shewhart Control Charts
Although control charts are very common to monitoring process changes, they usually do not indicate the real time of the changes. Identifying the real time of the process changes is known as change-point estimation problem. There are a number of change point models in the literature however most of the existing approaches are dedicated to normal processes. In this paper we propose a novel app...
متن کامل